GLMs and Variance

How much of the variation in the outcome variable is explained by the predictor variables?

Multiple Regression Makes a Composite Variable

  • Linear combination of predictor variables that is maximally correlated with outcome variable
  • How well can you predict the outcome by the set of predictor variables?
    • correlation of \(y\) with \(\hat{y}\)
    • \(R^{2}\) = squared correlation coefficient of \(y\) with \(\hat{y}\)

Milk Energy

M <- read_excel("../data/Milk.xlsx", na = "NA") %>%
  select(species, kcal.per.g, mass, neocortex.perc) %>%
  drop_na() %>% 
  rename(Species = species,
         Milk_Energy = kcal.per.g,
         Mass = mass,
         Neocortex = neocortex.perc) %>%
  mutate(log_Mass = log(Mass))

Visualizing data

Multivariate model

fm_Multi <- lm(Milk_Energy ~ Neocortex + log_Mass, data = M)
summary(fm_Multi)
## 
## Call:
## lm(formula = Milk_Energy ~ Neocortex + log_Mass, data = M)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.250574 -0.039212  0.000633  0.072997  0.201985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -1.085254   0.515281  -2.106  0.05372 . 
## Neocortex    0.027931   0.008015   3.485  0.00364 **
## log_Mass    -0.096402   0.024749  -3.895  0.00162 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1265 on 14 degrees of freedom
## Multiple R-squared:  0.5317, Adjusted R-squared:  0.4648 
## F-statistic: 7.948 on 2 and 14 DF,  p-value: 0.004939

Visualizing multiple regression

\(R^{2}\): Multiple Regression Makes a Composite Variable

\(R^{2}\): Multiple Regression Makes a Composite Variable

cor(y_hat, M$Milk_Energy)^2
## [1] 0.5317037
summary(fm_Multi)$r.squared
## [1] 0.5317037

“Analysis of Variance”

Some total variability in \(y\):

  1. Part explained by group membership
  2. Part remains unexplained (“error” or “residual”)

\(F\)-statistic is the ratio of the two.

Visualizing ANOVA

\[F = \frac{\mbox{Between Group Variation}}{\mbox{Within Group Variation}}\]

Parts of an ANOVA table

## Analysis of Variance Table
## 
## Response: Shift
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## Treatment  2 7.2245  3.6122  7.2894 0.004472 **
## Residuals 19 9.4153  0.4955                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Sum Sq: Variability accounted for by that part of the ANOVA
  • Mean Sq: Sum Sq / Df
  • F value: Mean Sq Treatment / Mean Sq Residual
  • Pr(>F): P-value for the F-test of that variable

How much variation is explained by group membership: \(R^{2}\)?

## Analysis of Variance Table
## 
## Response: Shift
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## Treatment  2 7.2245  3.6122  7.2894 0.004472 **
## Residuals 19 9.4153  0.4955                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Sum Sq: Variability accounted for by that part of the ANOVA
  • Mean Sq: Sum Sq / Df
  • F value: Mean Sq Treatment / Mean Sq Residual
  • Pr(>F): P-value for the F-test of that variable

How much variation is explained by group membership: \(R^{2}\)?

\[R^{2} = \frac{\mbox{Variation accounted for by group membership}}{\mbox{Total variation}}\]

\[R^{2} = \frac{\mbox{Sum Sq Group}}{\mbox{(Sum Sq Group + Sum Sq Residuals)}}\]

How much variation is explained by group membership: \(R^{2}\)?

## Analysis of Variance Table
## 
## Response: Shift
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## Treatment  2 7.2245  3.6122  7.2894 0.004472 **
## Residuals 19 9.4153  0.4955                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fm_lm)$`Sum Sq`[1]/sum(anova(fm_lm)$`Sum Sq`)
## [1] 0.4341684
7.224492/(7.224492 + 9.415345)
## [1] 0.4341684

How much variation is explained by group membership: \(R^{2}\)?

## 
## Call:
## lm(formula = Shift ~ Treatment, data = JL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27857 -0.36125  0.03857  0.61147  1.06571 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   -0.30875    0.24888  -1.241  0.22988   
## Treatmenteyes -1.24268    0.36433  -3.411  0.00293 **
## Treatmentknee -0.02696    0.36433  -0.074  0.94178   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7039 on 19 degrees of freedom
## Multiple R-squared:  0.4342, Adjusted R-squared:  0.3746 
## F-statistic: 7.289 on 2 and 19 DF,  p-value: 0.004472